\section{Our technique}

\subsection{Basic technique}
\label{sec-basic-technique}

To illustrate our technique, we first show a very simple version of it
in the form of the following code:%
\footnote{Throughout this paper, we rely on the left-to-right
  evaluation order mandated by the \commonlisp{} standard.}

{\small
\begin{verbatim}
(defun count-from-end (x list)
  (labels ((aux (x list n)
             (cond ((= n 0) 0)
                   ((= n 1)
                    (if (eq x (car list)) 1 0))
                   (t (let* ((n/2 (ash n -1))
                             (half (nthcdr n/2 list)))
                        (+ (aux x half (- n n/2))
                           (aux x list n/2)))))))
    (aux x list (length list))))))
\end{verbatim}
}

This function starts by computing the length of the list and then
calling the auxiliary function with the original arguments and the
length.  The auxiliary function calls \texttt{nthcdr} in order to get
a reference to about half the list it was passed.  Then it makes two
recursive calls, first with the second half of the list and then with
the first half of the list.  The recursion terminates when the list has
a single element or no element in it.  When it has no element in it,
clearly the count is $0$.  When it has a single element in it, the
element is compared to the argument \texttt{x} and if they are the
same, the value $1$ is returned, otherwise $0$ is returned.

The main feature of our technique is that it trades fewer recursive
calls for multiple traversals of the list.  The maximum number of
simultaneous active invocations of this simple function is around
$\mathsf{lb}\thinspace n$, where $n$ is the length of the list.  The
maximum value of this number is quite modest.  On a 64-bit processor,
it can never exceed $60$ and it is significantly smaller in practice
of course.  
% not relevant for count but for find
% The number of times this function computes the
% \texttt{cdr} of a list depends on where in the list the item to be
% found is located.  If it is the \emph{last} element of the list (best
% case), each \texttt{cons} cell is processed twice; once to compute the
% length of the list, and once again as part of the recursive traversal.
% When the item to be found is the \emph{first} element of the list
% (worst case), 
The number of \texttt{cdr} operations can be
approximately expressed as $n\thinspace (1 +
\frac{1}{2}\mathsf{lb}\thinspace n)$.  In \refSec{sec-analyses} we
analyze this result in greater depth.

The best case for this function is very efficient indeed.
The worst case is unacceptably slow.  Even for a list of some
reasonable length such as a million elements, the execution time is a
factor $6$ slower than for the best case.

The remainder of this section is dedicated to ways of improving on the
performance of the basic technique.

\subsection{Using more stack space}
\label{sec-more-stack}

By far the most important improvement to the basic technique is to
take advantage of the available stack space to decrease the number of
multiple list traversals required by the basic technique.

The following example illustrates this technique by using the simple
recursive traversal if there are fewer than $10000$ elements in the
list.%
\footnote{The number $10000$ was chosen to be a significant part of a
  typical per-thread default stack while still leaving room for stack
  space required by callers and callees of this function.  In a real
  production implementation, the number would be chosen based on the
  remaining space left on the stack when the function is called.}
If there are more elements, then it divides the list in two,
just like the basic technique shown in \refSec{sec-basic-technique}.

{\small
\begin{verbatim}
(defun count-from-end-2 (x list)
  (labels ((recursive (x list n)
             (if (zerop n)
                 0
                 (+ (recursive x (cdr list) (1- n))
                    (if (eq x (car list)) 1 0))))
           (aux (x list n)
             (if (<= n 10000)
                 (recursive x list n)
                 (let* ((n/2 (ash n -1))
                        (half (nthcdr n/2 list)))
                   (+
                    (aux x half (- n n/2))
                    (aux x list n/2))))))
    (aux x list (length list))))
\end{verbatim}
}

With this improvement, the number of \texttt{cdr} operations required
can now be expressed as ap\-proximately 
\[
n\thinspace (1 +
\frac{1}{2}\mathsf{lb}\thinspace \frac{n}{10000})
\]
which is significantly better than the corresponding value for the
basic technique.

However, there is no particular reason to divide the list into $2$
equal-sized parts when there are too many elements for the basic
technique. \refSec{sec-benchmarks} gives a more complete explanation
of the parameters involved and how they influence the execution time
of the resulting code.

\subsection{Analyses}
\label{sec-analyses}

In this section we give approximate formulas for the performance of
our technique.
The basic measure we are interested in is the number
of \texttt{cdr} operations that must be performed as a function of the
number of elements of the list.  We will denote the number of elements
of the list by $N$ and the number of \texttt{cdr} operations required
by $F(N)$.  Since our technique always starts by traversing the entire
list in order to compute $N$, we can always write $F(N)$ as $N +
f(N)$, were $f(N)$ is the number of \texttt{cdr} operations required
in the subsequent step.

For the basic technique where the list is divided into two equal-size
sublists, we obtain the following recursive relation:

\label{analyse1}
\[ f(N) = \left\{ \begin{array}{ll}
                    0 & \mbox{if $N = 1$} \\
                    \left\lfloor\frac{N}{2}\right\rfloor
                    + f(\left\lfloor\frac{N}{2}\right\rfloor)
                    + f(\left\lceil\frac{N}{2}\right\rceil) &
                    \mbox{otherwise}
                  \end{array} \right. \]

In order to obtain an approximate solution to this relation, we can
solve for $N$ being a power of $2$, i.e., $N = 2^n$.  In that case,
for $N>1$ we obtain:

\[ f(N) = \frac{N}{2} + 2f(\frac{N}{2}) \]

The details of the approximate resolution of this recursive equation
is given in the appendix.
This solution yields

\[ f(N) = \frac{N}{2}\mathsf{lb}~N + Nf(1) = \frac{N}{2}\mathsf{lb}~N\]

Including the traversal to compute the number of elements of the list,
we obtain:

\[ F(N) = \frac{N}{2}\mathsf{lb}~N + N = N(1 + \frac{1}{2}\mathsf{lb}~N)\]

which is clearly $O(N\mathsf{log}~N)$.  More importantly, for a list
with around $16$ million elements (which fills the default heap of most
implementations we have tested), we have $N \approx 2^{24}$ which
gives $F(N) \approx 13N$ which is probably unacceptably slow.

Let us now consider what happens when we are able to handle more than
a single element with the basic recursive technique, as shown in
\refSec{sec-more-stack}.  We denote the number of elements that the
basic recursive technique can handle by $K$, and again, in order to
simplify the analysis, we assume that both $N$ and $K$ are powers of
$2$, i.e., $N = 2^n$ $K = 2^k$, and also that $N \ge K$.  The
recursion relation now looks as follows:

\label{analyse2}
\[ f(N) = \left\{ \begin{array}{ll}
                    N-1 & \mbox{if $N \le K$} \\
                    \frac{N}{2} + 2f(\frac{N}{2}) &\mbox{otherwise}
                  \end{array} \right. \]

The resolution of this equation is given in the appendix (Part B).
It yields:

\[ f(N) \approx N(1 + \frac{1}{2}\mathsf{lb}~\frac{N}{K})\]

With the best portable version of our technique and a typical stack
being able to handle $K = 2^{14}$ we are now looking at a 
performance for $N = 2^{24}$ of $F(N) \approx 6N$.  Comparing this
result to the technique of reversing the list, it is fair to say that
the overhead of allocating and subsequently garbage-collecting a
\texttt{cons} cell can very well be comparable to $6$ times the time
taken by the \texttt{cdr} operation.  In other words, the performance
of our portable version is already comparable to an implementation
based on first creating a reversed copy of the list and then
traversing that reversed copy.

Finally, instead of using more stack space for the base case, let us
analyze what happens if we divide the original list into more than two
parts.  For this analysis, let us assume that we divide the list into
$M$ equal parts, and that $M$ also is a power of $2$ so that $M =
2^m$.  We then obtain the following relation:

\label{analyse3}
\[ f(N) = \left\{ \begin{array}{ll}
                    0 & \mbox{if $N = 1$} \\
                    N - \frac{N}{M} + Mf(\frac{N}{M}) &\mbox{otherwise}
                  \end{array} \right. \]

The resolution of this equation is given in the appendix (Part C).
It yields:

\[ F(N) \approx N(1 + \frac{\mathsf{lb}~N}{\mathsf{lb}~M}) \]

While it may appear that we can get very good performance when $M$ is
chosen to be large, in practice, using large values of $M$ introduces
a different kind of overhead, namely large stack frames, making the
gain less than what the formula suggests.

\subsection{Implementation-specific solutions}

So far, we have explored techniques that can mostly be implemented in
portable \commonlisp{}.  In this section, we explore a variation on
our technique that requires access to the control stack of the
implementation.

Recall that at the lowest level of our technique, there is a recursive
function that is used for traversing the list when the number of
elements is small compared to the stack size.  At each invocation,
this function does very little work.

With direct access to the control stack, we can convert the recursive
function to an iterative function that pushes the elements of the list
on the control stack, and then processes them in reverse order.  This
technique has several advantages:

\begin{itemize}
\item A single word is used for each element, whereas the recursive
  function requires space for a return address, a frame pointer,
  saved registers, etc.  As a result, this technique can be used for
  lists with more elements than would be possible with the recursive
  technique, thereby further decreasing the number of times a list is
  traversed.
\item There is no function-call overhead involved.  The only
  processing that is needed for an element is to store it on the stack
  and then comparing it to the item.
\end{itemize}

We illustrate this technique in a notation similar to \commonlisp{}:

{\small
\begin{verbatim}
(defun low-level-reverse-count (item list length)
  (loop for rest = list then (cdr rest)
        repeat length
        do (push-on-stack (car rest)))
  (loop repeat length
        count (eq item (pop-from-stack))))
\end{verbatim}
}

We implemented this technique in \sbcl{}.  In order not to have to
recompile \sbcl{} with our additional function, we used the
implementation-specific foreign-function interface and wrote the
function using the language C.  Rather than pushing and popping the
control stack, we used the built-in C function \texttt{alloca} to
allocate a one-dimensional C array on the control stack to hold the
list elements.

In \sbcl{}, the default stack size is $2$MBytes, or around $250$k
words on a 64-bit processor.  We tested our technique using $100000$
words on the stack.  The result is that for a list with $10$ million
elements, our technique processes the list in reverse order as fast as
an ordinary loop from the beginning of the list.

This surprising result can be explained by a few factors:

\begin{itemize}
\item Presumably in order to speed up the functions \texttt{car} and
  \texttt{cdr}, SBCL uses the same tag for \texttt{cons} cells and for
  the symbol \texttt{nil}.  As a result, in order to traverse a list,
  SBCL must make \emph{two} tests for each element, namely one to
  check whether the putative list is something other than a list
  altogether, and another to check whether it is a \texttt{cons}
  cell.  When our technique traverses a list for which the number of
  elements is known, there is no need to make any additional tests,
  simply because when the length of the list is positive, the first
  element must be a \texttt{cons} cell.
\item The \sbcl{} compiler can not determine that the return value of
  \texttt{count} must always be a \texttt{fixnum}.%
  \footnote{On a byte-addressed processor where $n$ word-aligned bytes
    are needed to represent a \texttt{cons} cell, the number of
    elements in a list can be at most $N/n$ where $N$ is the maximum
    number of possible addresses.  In a system that uses at most
    $\mathsf{lb}~n$ tag bits for a fixnum, the value that
    \texttt{count} returns must be a fixnum.  While some systems might
    use $8$ tag bits, \sbcl{} on a $64$-bit platform uses a single tag
    bit for fixnums.  As a consequence, \texttt{count} must then
    return a fixnum.}
  When the function is implemented in C, this problem disappears.
\end{itemize}

If we put this technique in the perspective of the analyses in
\refSec{sec-analyses}, we can also see that the number of \texttt{cdr}
operations remains quite modest, even for lists with a very large
number of elements.

There are several variations on this implementation-specific
technique.  Some implementations might allocate a vector or a list
declared to be \texttt{dynamic-extent} on the stack, thus giving
essentially the same advantage as the version we implemented in C.
However, such a technique would still be implementation specific,
given that it is permitted for the compiler to ignore
\texttt{dynamic-extent} declarations.  In the case of \sbcl{}, using
such a declaration, we were able to obtain performance almost as good
as our C version.  However, as it turns out, \sbcl{} only allocates a
vector on the stack under certain circumstances thereby making this
technique impossible to apply in general.
